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Abstract 

Principal Component Analysis (PCA) is a commonly used tool for dimension 

reduction in analyzing high dimensional data; Multilinear Principal Component 
00 ■ 

^SJ ' Analysis (MPCA) has the potential to serve the similar function for analyzing ten- 



sor structure data. MPCA and other tensor decomposition methods have been 
proved effective to reduce the dimensions for both real data analyses and simulation 
studies (Ye, 2005; Lu, Plataniotis and Venetsanopoulos, 2008; Kolda and Bader, 
---^ 2009; Li, Kim and Altman, 2010). In this paper, we investigate MPCA's statistical 

X 

^ ■ properties and provide explanations for its advantages. Conventional PCA, vec- 

torizing the tensor data, may lead to inefficient and unstable prediction due to its 
extremely large dimensionality. On the other hand, MPCA, trying to preserve the 
data structure, searches for low-dimensional multilinear projections and decreases 
the dimensionality efficiently. The asymptotic theories for order-two MPCA, in- 
cluding asymptotic distributions for principal components, associated projections 
and the explained variance, are developed. Finally, MPCA is shown to improve 
conventional PCA on analyzing the Olivetti Faces data set, by constructing more 
module oriented basis in reconstructing the test faces. 
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1 Introduction 

Dimension reduction is a key step for high dimensional data analysis. Principal compo- 
nent analysis (PCA) is probably the most commonly used method for dimension reduction. 
Given n observations on m variables, PCA calculates the m x m covariance matrix and 
solves the eigenvalue decomposition problem for the covariance matrix. The goal is to 
choose a smaller set of eigenvectors as a new coordinate system so that the newly trans- 
formed variables can retain the most data variation. This PCA approach has been widely 
applied in many scientific fields for dimension reduction and compact data representation 
(Jolliffe, 2002), where the collected data are organized in an n x m design matrix with 
each row representing an observation and each column a variable. 

When data are tensor objects, traditional analysis vectorizes each of the tensor objects 
into a long vector and arranges these vectorized objects in a design matrix form. Subse- 
quent analysis is followed in the usual way. Nevertheless, this approach usually produces 
a large number of variables, where the available sample size is relatively small, and many 
existing statistical methods fail to apply. For a typical example, like the Olivetti Faces 
data set to be used in an experimental study later, there are 400 images each with 64 x 64 
pixels. Vectorizing each image leads to a design matrix of size 400 x 4096, which the 
variable dimension m largely exceeds the sample size n. 

One strategy to overcome this difficulty is to take advantage of the natural tensor 
structure of the data. Singular value decomposition (SVD) is an example. Given a p x q 
matrix X which can be treated as an order-two tensor, SVD can decompose two directional 
spaces simultaneously: X = USV^ = Yl^=i^i^i'^I ^ where U = [ui,...,Up] E ^p^p and 
V = [f 1, Vq] G 9^'?^'^ are, respectively, the left and right singular vectors, S* is a diagonal 
matrix of size p x q with diagonal elements {si, The dimension can be reduced 
when the index i is properly truncated. De Lathauwer, De Moor and Vandewalle (2000a) 
then generalized the SVD to high-order SVD (HOSVD) for a given A^^'^-order tensor 
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object A G ^^^^'"^^'^ . Further, they formulated the problem of "best rank-(i?i, ■ ■ ■ ,Rn) 
approximation of higher-order tensors" in the least-squares sense, and discussed many 
algorithms to achieve this task (De Lathauwer, De Moor and Vandewalle, 2000b). 

Later, Yang et al. (2004) proposed two-dimensional PCA (2DPCA) for analyzing 
image data, which are order-two tensors. An improved two-directional two-dimensional 
PCA ((2D)'PCA) was developed in Zhang and Zhou (2005), which was shown to perform 
better than 2DPCA through simulation studies. Ye (2005) formulated the problem of gen- 
eralized low rank approximation of matrices, which can be treated as a sample extension 
of the best rank-(i?i, R2) approximation for order-two tensors in De Lathauwer, De Moor 
and Vandewalle (2000b). Lu, Plataniotis and Venetsanopoulos (2008) further generalized 
the work of Ye (2005) and proposed multilinear PCA (MPCA) for tensor objects of arbi- 
trary orders. There are other tensor decomposition methods for dimension reduction. For 
instance, Kolda and Bader (2009) provided a general overview of current development of 
tensor decomposition methods for unsupervised learning, their applications, and available 
softwares; Li, Kim and Altman. (2010) considered the tensor decomposition methods for 
supervised learning such as regression and classification. 

Similar to conventional PCA, the goal of MPCA is to look for low- dimensional mul- 
tilinear projection for tensor objects that captures the most data variation. Back to the 
example of Olivetti Faces, one eigenvector in conventional PCA creates an image basis ele- 
ment that contains 4095 free parameters. By contrast, one image basis element in MPCA 
or (2D)^PCA, which involves the Kronecker product of a column vector and a row vector, 
contains 126 free parameters. From the viewpoint of the number of parameters required to 
specify one basis element, MPCA is expected to perform better than conventional PCA, 
when the sample size is small to moderate, like this Olivetti Faces example. Compared 
to (2D)^PCA, MPCA has the advantage of capturing more data variation by the chosen 
image basis, because of its specific criterion. MPCA has been successfully applied in real 
data analysis and checked by simulations (Ye, 2005; Lu et al., 2008). Yet, to our best 
knowledge, there is neither statistical justification nor asymptotic study for MPCA. 

In this paper, we try to establish some relevant properties of order-two MPCA from a 
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statistical point of view. Our study is based on the following model: 

X = fx + AoUBl + e, 



(1) 



where /i G 'W^'^ is the mean parameter of X, Aq G W^^^^ and Bq G 3?'^^''° with po < p and 
qo < q are non-random basis matrices, f/ G 3^^^'"^'?o is a random coordinate matrix with 
E\U] = and a strictly positive definite covariance matrix Cov(vec(f/)) = T G 3[^™o^™o, 
where mo = PoQ'o and vec(-) is the operator that stacks the columns of a tensor into a 
long vector. The error term e G ^'^^'^ is a random matrix independent of U and with 
£'[£:] =0 and Cov(vec(£)) = a'^Im, where m = pq. Under model (1) which characterizes 
the tensor structure of X, we justify the validity of MPCA. Asymptotic properties of 
MPCA are rigorously developed, including asymptotic distributions for principal compo- 
nents, associated projections and the explained variance. It is also shown that MPCA is 
asymptotically more efficient than (2D)^PCA in estimating the target dimension reduc- 
tion subspace. Furthermore, a test of dimensionality is developed, based on the derived 
asymptotic results. 

This paper is organized as follows. Section 2 presents some properties of the estimation 
for the target subspace and a test for its dimensionality. The relations between MPCA 
and both conventional PCA and (2D)^PCA are also discussed in this section. In section 3, 
the asymptotic theory of MPCA is developed. In section 4, the performance of MPCA 
and its comparison with conventional PCA is demonstrated by analyzing the Olivetti Faces 
data set. The paper ends with a brief discussion. Technical proofs of main results are 
deferred to the Appendix. 

2 MPCA 

MPCA, as a dimension reduction algorithm, is originally designed to search basis matrices 
{A, B} and coordinate matrices f/j's that best approximate the observed data as AUiB^ 
for i = 1, . . . ,n. Although many simulation studies and real data analyses in literature 
support the usage of MPCA and multilinear tensor decomposition (Ye, 2005; Lu et al., 
2008; Kolda and Bader, 2009; Li et al., 2010), there is no theoretical study from the 
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statistical point of view. Let ® be the Kronecker product. Then, there is an equivalent 
formula for model (1) 



X = n + AqUBq +e <^ vec(X - //) = {Bq Ao)yec{U) + yec{e) 



(2) 



by the fact that vec^AoUBj) = {Bq (g) y4o)vec(f/). Without loss of generality, we may as- 
sume that Aq and Bq are orthogonal matrices, i.e., AqAq = Ip^ and BqBq = Iq^. Model (1) 
thus ensures that, without considering the error term e, the columns and rows of {X — fi) 
belong to span(y4o) and span(i?o), respectively, and vec(X — /i) belongs to the subspace 
span(i?o) 8' span(/lo) = span(i?o ® Aq). It is then reasonable to estimate span(i?o ® ^o) 
for follow-up analysis such as data compression, pattern recognition, regression analysis, 
etc. In this section, we show that, under model (1), MPCA actually attempts to extract 
a basis pair {A,B} targeting the subspace span(i?o ® Aq). Proposition 2.2 below proves 
the existence of a solution pair {A,B}. Proposition 2.5 summarizes that the inclusion 
relation between span(y4) (resp., span(i?)) and the target dimension reduction subspace 
span(ylo) (resp., span(i?o)), depends on the size comparison between the specified dimen- 
sionality p (resp., q) and Pq (resp., go)- Recognizing the important roles of p and q, we 
construct a hypothesis test for choosing p and q. These works justify the usage of MPCA 
in extracting the relevant basis for subsequent analysis, provided the data has a natural 
tensor structure. 

2.1 Estimation 

Let {Xi}'^^i be the collected data set which are assumed to be random copies of a random 
matrix X G 9?^^'^. MPCA aims to extract the basis pair that best approximate {Xj}"^^ 
while preserving the tensor structure of them. In particular, for a pre-specified dimen- 
sionality (p, g). Ye (2005) proposed a criterion to find A G Op^p,B G Cg,^, and {Ui}^^^ 
that minimize 




(3) 



i=l 



where X = ^ XlILi -^i sample mean matrix, || ■ ||f is the Frobenius norm of a matrix, 

and (9^ I is the collection of all orthogonal matrices M of size i x i such that M'^M = I^. 
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Note that the objective function (3) can be expressed as 

n 

- V ||vec(X, -X)-{B0 A)yec{Ui) \\l (4) 
n ^ — ^ 

i=l 

If we replace {B^A) by F in (4), the minimization problem then becomes the conventional 
PCA. From this viewpoint, MPCA can be treated as a constrained PCA with the tensor 
constraint F = (B^A), where A G Op^p and B G Oq^q. The following theorem established 
in Ye (2005) characterizes some useful properties of the solutions of the minimization 
problem (3). In the rest of discussion, Pm denotes the orthogonal projection matrix onto 
span(M) and Qm = I — Pm- 

Theorem 2.1. (Ye, 2005) Let A,B,{Ui}'^^-^ constitutes a minimizer for (3) under the 
dimensionality [p, q) . Then, 

(a) U, = A'^{Xi-X)B. 

(h) {A, B} is the maximizer of ^ ^"^^ \\A^{Xi - X)B\\j,. 

(c) A consists of the leading p eigenvectors of ^ X]r=i(^« ~ ^)PB{-^i ~ ^Y" > ^''^^ ^ 
consists of the leading q eigenvectors of ^ Yl^=ii-^i ~ Pxi-^^ ^ 

Similarly, we can define a population version of (3): -ElK^ — fi) — AUB'^W'jp, and the 
corresponding minimizer should follow Theorem 2.1 such that the minimizer over A G Op^p 
and B G Oq^g, is equivalent to the maximizer of the maximization problem: 

argmax E\\A^{X - fi)B\\l = argmax trace {{B ^ Afj:{B A)} , (5) 

where S = Cov(vec(X)). The following proposition gives the existence of the solution. 

Proposition 2.2. For a fixed but arbitrary positive semi- definite matrix S of size pq x pq, 
solution(s) to the maximization problem (5) exists. 

Note that we do not need the model assumption (1) for Proposition 2.2. Also note 
that Proposition 2.2 applies to problem (3) as well by replacing S with its sample estimate 
Sni the sample covariance matrix of {vec(Xj)}"^]^, and by rephrasing the maximization 
problem into the equivalent minimization problem. With the existence of the maximizer 
in (5) we can formally define the tensor principal components and the MPCA subspace. 
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Definition 2.3. For a pre-specified dimensionality {p, q), let {A, B} he the unique solution 
to the maximization problem (5), where A and B can be expressed in their columns as 
A = [ai, . . . ,ap\ and B = [bi, . . . ,bq]. We call {bj ^ ai : 1 < i < p, 1 < j < q} the tensor 
principal components, and span(i? ® A) the MFC A subspace of dimensionality [p, q). 

Using similar arguments as in Tlieorem 2.1 (c), we liave tliat A and B consist of 
the leading p and q eigenvectors of E[{X — ii)Pb{X — /i)^] and E[{X — fL)^PA{X — /i)], 
respectively. Since 

q q 
E[{X - ^)Pb{X - ^f] = J2 E[{X - f^){b,bJ){X - = Y,{b, ® I.f^ib, ® I,), (6) 

j=i i=i 
p p 
E[{X - fifPAiX -l^)]=J2 ^[(^ - ^^fi(^^(^J)iX - /i)] = ® Ipf^{a^ ® /p), (7) 

i=l i=l 

equivalently, {A, B} consist of the leading solutions of the system of stationary equations 



^{bj O Ip)'^T.{bj (g) Jp) j at = XiQi, i = !,■■■ ,p, 
j=i J 
P \ 
^(/g O ai)^S(/g O ai) bj = ^jbj, j = !,■■■ , 



j=i 

over A G Op^p and B G Og^q, where the ordering is determined by the corresponding 
eigenvalues Ai > ■ ■ ■ > Ap > and > ■ ■ ■ > C,q > 0. 

Remark 2.4. Obviously Xi's, ai's, ^j's and bj's depend on S. Besides such dependence, 
they also depend on the dimensionality [p, q) . A more precise notation for them should be 
Xi(T,,p,q), ai(T,,p,q), C,j(T,,p,q) and bj(T,,p,q). However, for notation simplicity, we use 
\i, ai, ^j and bj, unless we want to emphasize on their dependence on g). 

From Remark 2.4, for any fixed {p,q), we could define the sample analogues {A^B}, 
Aj's, and ^j^s by replacing S with the sample covariance matrix Sn- In the rest of the dis- 
cussion, with pre-specified dimensionality {p,q), we denote the solution of (5) by {A,B} 
and the population tensor principal components by {B ® A), and the corresponding sam- 
ple analogues by {A, B} and {B®A). Finding principal components in conventional PCA 
is equivalent to an eigenvalue-problem. However, there is no explicit solution of {A, B} 
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for MPCA; therefore, an algorithm was proposed. The GLRAM algorithm of Ye (2005) 
to obtain {A, B} is summarized below. 

GLRAM (Ye, 2005): Given a random initial A^^^ G Op^p. For A; = 1, 2, ■ ■ ■ , 

1. Obtain the maximizer 3^''+^^ = argmaxsgo^^. ^ 'Z'Li 11^^''^^ - X)B\\l. 

2. Obtain the maximizer = argmax^eOpxp ^ Er=i W^^i^i - X)B^''+^^\\l. 

3. Repeat Steps 1-2 until there is no significant difference between ^ Y17=i II ^'''^^"^(^i ~" 
X)fi«|||andi^"^J|A('=+i)^(X,-X)5('=+i)||2,. Output {A 5} = {^('=+1), 

For any fixed A^'^^ or B^''~^^\ the optimization problems in Steps 1 and 2 are the usual 
eigenvalue-problems of sizes p and q, respectively. Hence, A^^"*"^) and can be easily 

obtained. Moreover, the algorithm ensures the quantity ^Yl^=i ||^*''^''^(-^j ~ X)B^''^\\'jp 
to be monotonically increasing as k increases and, hence, the solution must exist since 
^ Er=i \\A^''^^{X^-X)B^^^\\l is bounded above by i 11^* -^IIf- Because GLRAM 
can only find a local maximum (depends on the chosen random initial A^'^^), multiple 
random initials are suggested by Ye (2005) to ensure the global maximum. In contrast to 
this suggestion, we propose to use the leading p eigenvectors of ^ — X){Xi — X)'^ 

as an initial of A^. 

We observe that hierarchical nesting structure may not exist for MPCA. Precisely, if 
{p',q') < {p,q) with the corresponding solution pairs {A',B'} and {A,B}, respectively, 
there is no guarantee that span(y4') C span(yl), nor span(i?') C span(i?). In the popula- 
tion level, however, there certainly exist relationships between the target subspaces and 
the MPCA subspaces prescribed by the optimization problem (5). 

Proposition 2.5. Assume model (1) and let {A, 5} he the solution pair to the maximiza- 
tion problem (5) under dimensionality {p,q)- 

(a) If p > po and q > qo, then span(y4) D span(y4o) and span(i?) 3 span(_Bo)- 

(b) If p < po and q > qo, then span(A) C span(y4o) and span(i?) D span(i?o)- 

(c) If p > Po and q < qo, then span(A) D span(Ao) and span(i?) C span(i?o)- 
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(d) If p < po and q < qo, then span(A) C span(ylo) and span(i?) C span(i?o)- 

Even though there is no general hierarchical nesting structure for MPCA subspaces, 
Proposition 2.5 ensures the existence of a specific nesting structure, which the extracted 
MPCA subspace is a proper subspace of the target subspace if the dimension is under- 
specified, and contains the target subspace if the dimension is over-specified. It also 
implies that MPCA indeed searches the true target subspace span(i?o Aq) when [p, q) = 
[Pq, go) is correctly specified. As a result, these arguments provide a justification of using 
span(i? (g) A) in the sample level for subsequent statistical analysis. 

2.2 Connection with (2D) PCA and conventional PCA 

The (2D)^PCA is another method to extract basis for tensor objects. For a given di- 
mensionality {p,q), the population (2D) PCA components A* = [ and B* = 
[bl, ■ ■ ■ , 6|] are defined to be the leading p and q eigenvectors of E[{X — n){X — fi)'^] 
and E[{X — ^)'^{X — yu)] with the corresponding eigenvalues {A* : 1 < « < p} and 
{Cj : 1 < j < Q'}- The sample analogues, denoted by A*, B*, A*, and i,* are similarly 
defined to be the leading eigenvectors and eigenvalues of ^ ^"=i(^j ~ — and 
n X]r=i("^« ~" "~ The following proposition states a connection between the 
(2D)¥CA and MPCA in the population level. 

Proposition 2.6. Assume model (1) and that {A* : 1 < i < po} o-'^^d {^* : 1 < j < go} 
are simple roots. 

(a) If q > qo, then MPCA and (2D)'^PCA share the same leading (poAp) eigenvectors, 
i.e., tti = a* for i = 1, ■ ■ ■ , {po A p) . Moreover, A* — Aj = (g — q)a^ for i = 1, . . . ,p. 

(b) If p> Po, then MPCA and (2D)'^PCA share the same leading {qo Ag) eigenvectors, 
i. e., bj = b* for ? = 1, ■ ■ ■ , (go A g) . Moreover, ^* — = (p — p)<j'^ for j = 1, . . . , g. 

When the dimension [p, g) is adequate. Proposition 2.6 implies that (2D)^PCA and 
MPCA, in the population level, actually target the same subspace span(5o ® ^o) under 
model (1). However, there is no guarantee that the extracted bases {A*, B*} of (2D)^PCA 
also maximize the sample version of (5). Though, under the setting of Proposition 2.6, 
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E[[X — ij)Pb{X — nY] and E[{X — fi){X — /i)'^] have the same leading eigenvectors, we 
expect an efficiency gain in using E[{X — fi)PB{X — fi)'^], since it is less noise-contaminated 
than E[{X — fi){X — yu)^]. A rigorous proof of efficiency gain is provided in Section 3. 

Remark 2.7. From Proposition 2.6, it is suggested to select {p,q) through {2D)^ PCA, 
since the dimension of Aq and Bq are not known before being estimated. A formal statis- 
tical test is provided in Section 2.3. 

There is also a connection between MPCA and conventional PCA. Under model (1), 
without considering the random noise e, vec(X — /i) belongs to span(i?o ® ^o)? which is 
the target subspace of MPCA. Observe also that 

E = {B^®AQ)T{B^®A^f + aH^ 

= {Bo^Ao){T + a^I^,){Bo^Aof + a^QBo^Ao, (8) 

where Qbo'^Ao = Qbq® Paq + Pbq ® Qao + Qbq ® Qao is the projection matrix onto the 
complement of span(i?o ® ^o)- It should be noted that the matrix T is not necessarily a 
diagonal matrix. Hence, (Bq^Aq) is not the same with the conventional PCA components 
in general. If we further diagonalize T = GDG^ with D being a diagonal matrix of size 
mo X mo, we have the following factorization: 

s = [r. Fx] 

where F = [Bq ® Aq)G and F^ is an orthonormal basis for the orthogonal complement 
of span(F). Consequently, the conventional PCA uses F = [Bq ® Aq)G as coordinate 
system for a compressed representation for vec(X — yu), while the MPCA uses {Bq ® Aq). 
Notice that span(i?o ® ^o) = span(F) provided that T in (8) is of full rank. In summary, 
MPCA and conventional PCA use the same subspace for compressed data representation. 
However, MPCA requires less parameters (see the following remark) to specify the low- 
dimensional subspace than the conventional approach. 

Remark 2.8. The number of free parameters required for MPCA is pop — ^po(Po + 1) + 
QoQ ^ 1^0 (^o + 1); which is relatively small in contrast with the number of free parameters 
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D + a^L 



mo 



m—mo 



[r, rj^. 



required for conventional PC A: poqopq—jpoqo^poqQ + l). It is the adoption of (Bq^Aq) for 
the sake of parsimony, which is one of the purposes of using MPCA. The following table 
gives the numbers of parameters needed to specify an orthonormal basis for a subspace of 
dimensionality po x gg within a space of dimensionality p x q = 100. We fix {p, q,Po) = 
(10, 10, 5) and let go vary. 



qo 1 


2 


3 


4 


5 


MPCA 44 


52 


59 


65 


70 


PCA 485 


945 


1380 


1790 


2175 



Table 1: Numbers of required free parameters at {p,q,Po) = (10, 10,5). 

We remind the reader that there is no obvious ordering relationship between the MPCA 
components and conventional PCA components. This can be seen in a simple example 
when T = Cov(vec(f/)) = diag(vec(C)), where C is a matrix with Cij = Var(f/jj). For 
the case of uncorrelated f/jj's, T is diagonal, and hence, the conventional PCA and the 
MPCA share the same eigenvectors. The leading p^q^ eigenvalues of the conventional 
PCA are {Cij + o"^ : 1 < i < po? 1 ^ J ^ ^o}) which have a natural ordering depending on 
the values of Ci/s. On the other hand, the leading eigenvalues of MPCA at {po,qo) are 
derived to be Ci. = Xljli ^ij^ i = !,■■■ ,po, and C.j = ^iji J = " " " > ?o, where the 
ordering depends on the column sums and row sums of Cjj's. Therefore, even if we pick 
and hj from leading eigenvectors of A and 5, there is no guarantee that, when paired 
together, bj ® ai is on the top list of leading eigenvectors of the conventional PCA. 

2.3 Selection of dimensionality 

This section is devoted to the selection of the dimensionality (p, g). Similar to the con- 
ventional PCA, we propose that the dimension is determined by the explained variance, 
as a popular method in conventional PCA. First we define the cumulative variance, which 
is a measure of the total variance of the tensor objects projected onto MPCA subspace. 



11 



Definition 2.9. Let {A, B} be a solution pair to the problem (5). We call the quantity 
^{pA) = E\\A^{X — fi)B\\'j^ the cumulative variance for X at rank-{p,q), and the quantity 

the explained percentage of total variance of X at rank-{p, q). Note that q) = E\\X — 
The corresponding sample analogues are defined to be g) = ^ X]r=i ^ 
X)B\\l, g) = i Zti - X\\l and 

${p,q) 
P{P,<1) = r- 

Remark 2.10. Note that q) > q') does not necessarily imply span(y4p ® Bg) D 
span(/lp/®i?g/). Similar phenomenon can be observed on the cumulative distribution func- 
tion. For instance, in a 2-dimensional c.d.f F, the phenomenon 'F{xi,X2) > F{x[,X2)" 
does not imply {{u,v) : u < Xi,v < X2} D {{u,v) : u<x[,v < Xg}. 

From the description below Definition 2.3, we have g) = Yl^=i-^i ~ X]j=i^j 
g) = Yl^=i^i ~ X]j=i4- Note that Aj and ^j, as well as Aj and ^j, depend on 
the specified dimensionality (p, g). Also note that g) < g) always holds. Thus, 
pip A) ^ 1 s-iid is used as a measure of adequacy for MPCA at dimensionality (p, g). 
Specifically, for a given po ^ (0, 1), consider the hypothesis test: 

Hq : p(p, g) < Po v.s. Hi : p(p, g) > po- (9) 

A rejection of Hq then indicates the chosen dimensionality (p, g) satisfies the condition 
that p(p, g) reaches the required level of explained variance at a certain confidence. To 
perform the test, a reference distribution for the sample analogue p(p, g) is required. We 
derive the asymptotic distribution of y/n{p{p, q) — p(p, g)) in Section 3, which can be used 
to construct the rejection region of the test. 



3 Asymptotic properties for MPCA 

In this section, we investigate the asymptotic behavior of MPCA. Without loss of gener- 
ality, we assume p = to simplify the notations in the rest of discussion. It then implies 
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E = E[yec{X)vec{X)'^] and the population kernel matrices of MPCA at dimensionality 
{p,q) can be simplified to be EIXPeX"^] and E[X'^ PaX]. Note also that the population 
kernel matrices of (2D)^PCA reduce to ElXX"^] and E[X'^X] in this situation. 

Let Sn be the sample covariance matrix of {vec(Xj)}"^]^, where Xj's are iid observations 
with finite second moments following model (1). By the central limit theorem, we have 

y/^iSn - S) 4 iV, (10) 

where vec(A^) is an m^-variate normal with zero-mean and covariance matrix Sjv = 
Gov (vec(X) ®vec(X)). If vec(X) is further assumed to be normally distributed, then 
Sn follows a Wishart distribution and Sat is derived to be (Anderson, 1963) 

^N = {Im^+Km,m){^®^), (11) 

where Ke^k = Y2l=i X]j=i ^ij ® ^Tj is the commutation matrix, and Hij is an i x k matrix 
with one in the («, j)*'' entry and zeros elsewhere. Some important properties involving 
commutation matrix are listed here (Magnus and Neudecker, 1979). Let Mi G 
and M2 e 9^"2x^2 be two arbitrary matrices. Then, Ka^^bi = -^^,ai' -^ai,fei-f^fei,ai = Libi, 
Ka,M = 4i if bi = 1, vec(Mn = i^,,,b,vec(Mi), and (Ms^Mi) = ir,,,,,(Mi ^Ms)/^^,,^,. 
These properties will be repeatedly used in the discussion of asymptotic theory without 
further reference. We note that, unless explicitly specified, the asymptotic properties 
derived in this section does not rely on the normality of vec(X). 

3.1 Asymptotic distributions for principal components, projec- 
tions, cumulative variance and explained variance in MPCA 

We first state the weak convergence of the cumulative variances and the tensor principal 
components of MPCA. The limiting distributions for projections and explained variance 
are direct applications of delta method. 

Theorem 3.1. Assume model (1) and, for any fixed {p,q) with p < po and q < qo, the 
leading p eigenvalues Xi{T,,p,q) 's and the leading q eigenvalues ^j(S,p, g) 's of MPCA are 
simple roots. 
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(a) For p < po and q < go/ we have the limiting distribution 



d 



D<S>ip,q) 

vec(/,„)^ 



vec(iV), 



where -D<i>(p,g) = gve(f('s) ^'^^ ^^■^ explicit expression is given in Lemma 3.2. 
(b) For p < Pq and q < q^,^ we have the limiting distribution 



(12) 



( 


vec(A) 




vec(A) 


) 




vec(i?) 




vec(5) 





(13) 



where 



(9a 1 



96i 



dbq J, 



1 T 



When {p, q) = {po, q^), Dh^^^^ has an explicit expression, which is given in Lemma 3.2. 
Lemma 3.2. Assume the model (1). 
(a) For p < Po and q < qo, we have 

D<s>{p,q) = vec(Ps^A)'^- (14) 



(b) When (p, q) = (po, go), for i 



1, ■ ■ ■ ,po and J = 1, ■ ■ ■ , go? we have 



{oi ® vec(PB„) ® {X Jp - E[XPB,X^]yf (Kp,, ® J^,) 



9vec(S) 

db, 



{b, ® vec(P^J ® (OJ, - E[X^PA,X])+f (Jp, ® i^. 



(15) 
(16) 



9vec(S) 

where, for a given matrix M, denotes its Moore-Penrose generalized inverse. 

It can be seen from Lemma 3.2 that, when (p, q) = {pq, go); the asymptotic distribution 
of A depends on B only through span(P) = span(i?o), and the asymptotic distribution of 
B depends on A only through span(A) = span(Ao). We are now on the position to obtain 
the asymptotic normality of the projection matrix onto MPCA subspace P§^x ^"^^ 
explained variance p(p, g) in the following corollaries. 



-'^For p > Po (or q > qo, rcsp.) ElXX'^] (or E[X'^X], rcsp.) has multiple roots from the (po + 1)"^ (or 

(go + l)*^j resp.) eigenvalue and beyond. 

^For either p > po, or q > qo, the {po + 1)*^, or (go + 1)"^, tensor principal components are not uniquely 

determined due to multiple characteristic roots. 
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Corollary 3.3. Under the same assumptions of Theorem 3.1. For p < po and q < qq, we 
have the limiting distribution of the projection matrix onto MFC A subspace 

^ vec{Pg^^ - Pb^a) a /^P,«,vec(iV), (17) 

where Dp^^^ = ^^^^^§f^- When {p,q) = {po,qo), ^Pb„®a„ explicit expression 

PO 



1=1 

10 



+ ® (Pb, ® [vec(P^Jvec(P^J'^] ® {^,1, - E[X'^Pa,X]Y) {I„ ® iT^,,) 

Corollary 3.4. Under the same assumptions of Theorem 3.1. For p < po and q < qo, we 

have the limiting distribution of the explained variance 

V^(P(P, q) - p{p, q)) A iV(0, (18) 
where a^^.- is defined to be 

Corollary 3.4 is the cornerstone of our asymptotic test for hypothesis (9). Before 
practical implementation of the test, however, we need a consistent estimator of 
Note that the asymptotic covariance S^v can be empirically estimated by 
1 " 

g^^i = _ ^ vec (vec(Xi - X)vec{X, - Xf - Sn) vec (vec(Xi - X)vec{Xi - Xf - Snf 

i=l 

Moreover, if vec(X) is normally distributed, we can also estimate Sjv by 

^N,2 = {Im? + Km,m){Sn ® Sn) 

based on (11). Consequently, the asymptotic variance is estimated by 

\<!>{p,q) $(p,g) J \Hp,q) $(p,g) 

for i = 1,2 (depends on the normality of vec(X) or not), where Dis,(^p^q) = vec(Pg^^)"^. 
The consistency of ^p(^pq) is a direct consequence by standard arguments. These facts 
enable us to construct an approximate level a test to determine the dimensionality (p, q) . 
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Theorem 3.5. Assume the conditions of Theorem 3.1 and {p,q) < {po,qo)- For the 
hypothesis (9), an approximated level a test is to reject Hq if 

p{p,q)>po + ^z^, (20) 

where is the upper a quantile of the standard normal. 

3.2 Asymptotic efficiency 

MPCA and (2D)^PCA actually target the same basis when {p,q) = {po,qo)- Intuitively, 
we are in favor of MPCA since its kernel matrices are less noise-contaminated than the 
ones of (2D)^PCA as mentioned previously. The following theorem proves that MPCA is 
indeed asymptotically more efficient than (2D)^PCA, wherein aCov denotes the asymp- 
totic covariance. 

Theorem 3.6. Assume the conditions of Theorem 3.1 and the normality ofyec{X). Let 
iP^Q.) = (Po^qo) (^i^d let [A*^B*) he the {2D)^ PCA components under {po,qo)- Then, 

aCov(vec(Pg.^;j.)) - aCov(vec(P5^^)) > 0, (21) 

where the equality holds if and only if [po, go) = {p, q)- 

Theorem 3.6 states that under model (1), MPCA is at most as disperse as (2D)^PCA 
in estimating the dimension reduction subspace span(i?o ® ^o)- The only case that we 
will gain nothing from MPCA over the (2D)2PCA is when (po,go) = ip,q)- Note that 
the condition [pq, go) = {p, q) implies that there is no room of dimension reduction at all 
and is probably of no interest in real applications. Consequently, Theorem 3.6 provides a 
justification of using MPCA. 

4 Experimental study: the Olivetti Faces data set 

We test and compare the performance of MPCA and conventional PCA on Olivetti Faces 
data set, which is available at http://www.cs.nyu.edu/~roweis/data.html. This data 
set consists of 400 gray scale (8 bits) face images of 64 x 64 pixels. There exist different 
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facial expressions and/or views for each individual in this data set. A simulation experi- 
ment is designed as follows. 400 face images are randomly partitioned into a training set 
with size 100 and a test set with size 300. This 100-300 partition, where the training set 
is smaller than test set, is to reflect a scenario of using a small portion of data to train a 
basis set for the representation of the rest data in data archive. 

Both MPCA and conventional PCA are applied on the 100 training images to produce 
image basis which is used to reconstruct the rest 300 test images. The average of the 
100 training images, named mean face, has been subtracted from all the 400 images for 
PCA training and for test image reconstructions as well. The mean face is finally added 
to the reconstructions at the last stage to show the resulting images. 500 replicates of 
training-test partitions are performed to compare the mean test error, which is defined 
as the average of the Frobenius norm between the original images and the reconstructed 
images on test data set. The result is in Table 2. The mean test error for conventional 
PCA is more than seven times of that for MPCA; and the standard deviation is more 
than 12 times. 



Frobenius- Norm 


MPCA Conventional PCA 


Mean (xlO^) 


1.1346 8.6455 


SD (xl02) 


9.6398 120.39 



Table 2: Test error comparison for MPCA and conventional PCA on the test images of Olivetti 
Faces data set. The error is defined as the Frobenius norm of two image matrices: original 
test image and its reconstruction using 28 x 28 principal components. 

In Figures 1-3, 40 test images are randomly chosen from the test set to show the 
visual performance of image reconstructions by these two PCA schemes. In MPCA, 28 
row eigenvectors and 28 column eigenvectors, both with size 64, are used to generate 784 
basis images, of which the 100 leading ones are shown in Figure 4. We remind the reader 
that the selection {p, q) = (28, 28) produces an p(p, q) value 0.968. Based on Theorem 3.5, 
a one-sided 95% confidence interval for p(28,28) is given by [0.967, 1]. We also show the 
variabihty pattern plots (Tu and Huang, 2011) in Figure 7. These plots present the average 
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variations (absolute values) of the eigenvectors for the bootstrap re-sampled data, from 
those eigenvectors for the original data. The horizontal and vertical indices refer to the 
eigenvector indices for re-sampled and original data. The indices of eigenvectors are sorted 
by eigenvalues. The variations are presented by colors from dark blue for perfect matched, 
to dark red for extremely deviated. Usually, eigenvectors with distinct eigenvalues show 
deep blue on the diagonal and deep red on the off-diagonal. Eigenvectors with the same 
multiple root eigenvalue tend to be visualized by a cubic pattern on their correspondence 
indices. It can be seen that our choice of (p, q) = (28, 28) does not produce multiple roots, 
since the bootstrapped variability of the solutions at this selection is quite small. 

In conventional PCA, 784 (= 28 x 28) eigenvectors (basis images) with size 4096 are 
used, of which the 100 leading ones are shown in Figure 5. Because of using 100 training 
images with average subtraction, there are at most 99 meaningful eigenvectors in the 
conventional PCA. The rest are randomly orthogonal eigenvectors with zero eigenvalue 
from the remaining subspace. In Figure 5, from top to bottom, we can see the images 
with clear facial shape to vague ones and a random image on the 100**^ one. On the other 
hand, MPCA tends to distribute the image characteristics to more basis elements which 
may allow for more local modification on the images. 

In Figure 6, one particular image among the 40 test images is chosen to demonstrate 
the performance of these two methods. The top row shows the image reconstruction 
process for MPCA when more basis elements are added in, and the bottom shows for 
conventional PCA. The mean face is put in the first column and the target image in 
the 7*^ column as references. The right-most column shows the absolute values of pro- 
jection scores on the leading 784 basis elements. It is clear that the conventional PCA 
concentrates on no more than 99 basis elements while the MPCA spreads out to much 
more basis elements. For MPCA, the image turns its view when 10 basis elements are 
used; the pupil turns to left when 16 x 16 basis elements are used; the double eyelid and 
nostrils show up when 22 x 22 basis elements are used; the facial curves become clear 
when 28 x 28 basis elements are used. While we can observe the reconstruction progress 
by adding more basis elements for MPCA, we do not see much difference after 100 basis 
elements for conventional PCA. It is clear that MPCA performs better than conventional 
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PCA in reconstructing the test images from Table 2 and these figures. 

5 Concluding discussions 

PCA is a popular tool to reduce the dimensions for high dimensional data analysis; MPCA 
could be likely to serve the similar function for higher order tensor data sets. From this 
work, the statistical properties of MPCA become clear through the theoretical framework 
and the performance of MPCA is predictable through the asymptotic results. Most im- 
portantly, based on these asymptotic results, various hypothesis tests become feasible for 
subsequent analysis, including pattern recognition or classification. Our work, though 
technically theoretical, may construct a platform to expand the application potentials of 
MPCA. 

The advantages of MPCA over conventional PCA on tensor structure data are evident 
in the Olivetti Faces data example. Therein, conventional PCA suffers seriously from the 
large m and small n problem such that there can be at most n — 1 meaningful eigenvectors. 
This makes it unavoidable that all the data noises are still carried by the chosen principal 
components. Furthermore, too concentrated information in one component, which may 
not be good for pattern recognition or classification prediction. On the other hand, MPCA 
distributes the information to more components which may allow local modification in 
the process of image reconstruction, with even fewer free parameters. The key point for 
the good performance of MPCA is the data tensor structure. For practical purposes, the 
robustness of MPCA over model variety should be further investigated. 
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Appendix 

Proof of Proposition 2.2. In the maximization problem (5), the objective function is 
continuous and the feasible region Og^g ® Op^p is compact. (Both continuity and com- 
pactness are with respect to the topology induced by Frobenius norm.) Thus, solution(s) 
exists. □ 

Proof of Proposition 2.5. (a) Let [B, B±] be a g x g orthonormal matrix. Since Bq G 
span([5, Ba_]), there exists r/i G 3?^''""' and r/2 G gft(9-9)x'?" such that Bq = Br]i + B±r]2. As 
BqBq = Jgy, we have 77^771 -|- 77^772 = I go- Observe that 

E\\A^{X - fi)B\\l 
= E {trace{A^{AoUB^)BB^{BoU^A^)A) } + E {ii&ce{A^ eBB'^ A) ] 
= E {tia.ce{A^ AoUrf[r]iU^ AlA) ] + pga^ 

= E {ii8.ce{A^AoUU^AlA) ] - E {trace(A^Ao?7r/Jr/2f/^A;^A) } + pga^ 

< E{ti8.ce{A^AQUU^AlA)]+Pqa'^, (A.l) 

where the equality in (A.l) holds if and only if 772 = 0, if and only if 7/^r/i = Ig^. Thus, 
if g > Qoi such an 7/1 (with rank go) exists to ensure the equality in (A.l). This implies 
Bq = Brji and, hence, Bq G span(i?). Similarly, Aq G span(yl) which establishes (a). 
To show (b), when g > go, from (a) we have Bq G span(i?) and 

max E\\A^(X - u)B\\l = max tiacei A'^EIAqUU^ A^U) + pqa^, (A.2) 

which is an eigenvalue-problem for the matrix E[AqUU'^ Aq]. By diagonalizing E[UU'^] = 
TuAu^u^ then, EIAqUU^A^] has Po non-zero eigenvalues A^/ with the corresponding 
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eigenvectors AqTu. When p < pq, the maximizer A consists of the first p columns of 
AqTu and, hence, span(A) C span(Ao). 

(c) can be estabhshed in a similar way as (b). 

To show (d), observe that 

E\\A^{X - i2)B\\l = E {tTace{A^AoUB^BB^BoU'^A^A)} +pqa^. (A.3) 

To maximize (A.3) over A G Op^p, B G Oq^q with p < po and q < go, the rank of A'^Aq 
and B^Bq must be p and q, respectively, in order to attain the maximal value. This can 
happen only if span(A) C span(y4o) and span(i?) C span(i?o)- D 

Proof of Proposition 2.6. We will only provide a proof for (a), and (b) can be ob- 
tained in a similar way. li q > qo, from Proposition 2.5 (a) we have span(i?o) C span(i?), 
which further implies that 

E[{X - fi)PB{X - fif] = E[AoUU^A^]+E[ePBe^], (A.4) 
E[{X - ij,){X - ij,f] = E[AoUU^A'^]+ E[ee'^]. (A.5) 

Note that E[ePBe'^] = qa'^Lp and E[ee'^] = qa'^Ip. Hence, E[{X - fi){X - fi)'^] and 
E[{X — fi)PB{X — fi)'^] have the same leading po A p eigenvectors as E[AoUU'^Aq] has. 
Moreover, we have Xi = di + qa"^ and A* = di + qa"^, where di is the i^^ eigenvalue of 
E[AqUU'^Aq]. Hence, A* — Aj = (g — g)cr^ for i = 1, . . . ,p, which completes the proof. □ 

Proof of Theorem 3.1. Let Hp^q{Sn) = (vec(A)'^, vec(5)^)'^ be the function maps Sn 
to its tensor principal components under {p, q), which gives Hp^qiY?) = {yec{A)^, yec{B)^)^ 
and Dh,^, = StS^- Note that = vec(/„)^vec(5„) and <l>(p,g) = vec(/„)^vec(S). 

From the weak convergence ^/n{Sn — T^) A- N and an application of the delta method, 
we have, for {p,q) < {po,qo), 

-<l>(p,g)) A D^^p^q)yec{N), 
V^{Hp,q{Sn)-Hp,q{J:)) A DH,,,yec{N). 

The explicit forms of Dt^i^j q^ and elements in -Dj^p^ are provided in Lemma 3.2. □ 
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Proof of Lemma 3.2. For a given pair (p, q) with 1 < p < p and 1 < g < g, we have, 
from (6) and (7), that A and B satisfy the following system of stationary equations 



y^X^j ® ^pV^i^j ® Q ) = ^iO-i^ 2 = 1, ■ ■ ■ ,p, 
3 = 1 ) 
V \ 

^(/y ® ai)^S(/g (g) Oi) hj = ^jbj, j = !,■■■ , g, 



where ai,bj,Xi,^j depend on (E,p, g). The indices z,j in the above system of equations 
can go beyond p and q and up to p and q. But those Oj and bj with i > p and j > q 
will not be included in the solution pair (A, B). Note that we have the following identity, 
which is due to the definition of $ and the stationary equations: 

p q 
HP,q) = J2X^{^,P,q) = (A.6) 

i=i j=i 

We will use the perturbation method (Sibson, Lemma 2.1, 1979; Fine, 1987) to derive 
the derivatives D^i^pq), o^cIj:) (9vec(s) • Suppose that S is perturbed to = S + eS. 
Denote the corresponding system of stationary equations with by 



^{bj,e ® IpY'L^{bj^^ O Ip) j Ci,, = Ai,, Oi,,, z = 1, ■ ■ ■ ,p, (A.7) 
® ai,e)^Se(/g ® aiM = J = 1, ■ ■ ■ , (A.8) 



i=l 



Let their first order expansions be denoted by 

K,e = Ai + eXi + o(e), a^^e = + ea^ + o(e), 

= + + = hj + efej- + o(e). 

Following the same arguments as in Lemma 2.1 of Sibson (1979) and by equating the 
terms involving e in (A.7) we have, for z = 1, ■ ■ ■ ,p, 

Xi = aJtBtti, (A. 9) 

Ai/p-5^(6,®/p)^S(6,®/p)| tsau (A.IO) 
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where 

= E[X(5E^ + 55^)X^] + ^(6, (A.ll) 

i=i 

Since 5f 5, = J^, then 5 = [ 6i, ■ ■ ■ , 6g ] must satisfy B + fi^fi = 0. 

(a) For (p, g) < (po, Q'o)) the first term of Yl^=i (^J^B^'i can be expressed as 

p q 
J2 ajE[X{BB^ + BB^)X^]ai = ^ (bj E[X^ PAX]bj + 6JE[X^Pa^]6,) 
i=i j=i 

which vanishes by noting that bj is an eig envector of E[XPaX^] and bjbj = bjbj = 0. 
This concludes that Yli=i -^i — Ylt=i (^S^=i(^i ® ^p)^^i^j ® -^p) j '^i ^^d, hence, 

P Q 

^HM) = XI ^^^^ ® Oi <8) fej ® Oi)^ = vec(PB55A)^- (A. 12) 

(b) Assume now {p,q) = (po^Qo)- To derive the form of -q^^i we are going to show 
that the first term of is zero and conclude = ® ^p)'^^i^j ® -^p)- This 
together with (A. 10) gives 

PI go 

= J2{bj®a,^b,^{XJp-E[XPBX^]}+f 

= {a, ® vec(PijJ ® (A,/p - E[XPb„X^]) + }^ (irp,^ ® /p,). (A.13) 

as desired, where the second equality follows from Proposition 2.5 that span(P) = span(Po) 
when q = qo- To complete the proof, first note that Proposition 2.5 ensures the existence 
of a nonsingular matrix rj such that Bq = Brj. From X = AqUBq + e (remember /i = 0) 
and the independent structure of U and e, we can represent the first term of as 

E[X{BB^ + BB'^)X^] = E[AoUB^{BB^ + BB'^)BoU^ A^] + a\ia.ce{B'^ 13 + 13'^ B)Ip 

= E[AqU7]'^ {B^ B + B^B)r]U^A^] + ahrace{B'^B + B^B)Ip. 

The proof is completed by noting that B^B + B-'^B = 0. The case of ^^^^^^-^ can be 
established in a similar way. □ 
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Proof of Corollary 3.3. Consider the function S) = Pb^a with the correspond- 
ing differential Dp(^A,B)- From Theorem 3.1 (b) and delta method, we have 



( 


vec(A) 




vec(A) 


) 




vec(fi) 




vec(fi) 





^b^a) = V^yec{FiA,B)-F{A,B)) 
= Df{a,b) 
A Dp,^,vec(iV), 

where Dp^^^ = Df{a,b)Dhp,^- When [p, q) = {po, qo), the expression of Dp^^^ is obtained 
by a direct calculation together with Lemma 3.2 (b) and Theorem 2.5 (a). □ 

Proof of Corollary 3.4- Consider the function F{x,y) = x/y with the corresponding 
differential Dp(^^^y-^ = {y~^, —xy~'^)'^. From Theorem 3.1 (a) and delta method, we have 

= ^F($(p,g),cl>(p,g)) 



n 



( 


q) 




<l>(p, q) 


) 













n 



A D 



F{'i>{p,q)MP,<l)) 



D 



<I>(p,g) 



vec(/„ 



vec(A^). 



A direct calculation gives the expression of the asymptotic variance o"p(pq)- 



□ 



Proof of Theorem 3.5. Under Hq, we have from Corollary 3.4 that, for n large enough. 



The consistency of a^^-^ and Slutsky's theorem complete the proof. 



□ 



Proof of Theorem 3.6. Since (p, g) = (po)Q'o)) we have A = A* and B = B* from 
Theorem 2.6, and span(A) = span(ylo) and span(i?) = span(i?o) from Theorem 2.5 (a). 
Let {tti : i > po} and {bj : j > go} be orthogonal bases of span((5Ao) ^P^^^Qbo), 
^B,q' = E?Li(^i ® ® bj (g) Ipf, g' = 1, ■ ■ ■ , g and Wa,p' = YfLi^k ®ai®Iq®a. 



T 
ij 1 



p' = ,p. Also define Ma = [(ai ® Mai), ■■■ , (opo ® M^pJ]^ and = [{h 

Mbi),--- ,(&go ® MBq,)f, where M^^ = {A,/p - E[XPb,X^]}+ and M^,- = {QIg 
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ElX"^ PaqX]}^ . By using these notations and from Theorem 3.1 (b), we have the limiting 
distribution of MFC A 



Vn(vec(A, B) - vec(A, B)) 4 MoWoyec{N) 

\ Ma 

where Mq 

bution of (2D)2PCA is derived to be 



Ma 





,Wo = 











Mb 







(A.14) 

. By Lemma A.l below, the limiting distri- 



where W( 



0+ 



^/^{^/ec{A*, B*) - vec(A*, B*)) 4 Mo{Wo + Wo+)yec{N), 



(A.15) 



Wa,po+ = Wa,p - Wa,po, and WB,g^+ = WB,g - W^B.go- 

Note that A = A* and B = B*. To complete the proof, by an application of delta 
method, it thus suffices to show 

aCov(vec(l*, B*)) - aCov(vec(A B)) > 0. (A. 16) 

From (A.14)-(A.15) we are left to show 

Mo{Wo^nW^^ + Wo+^nW;[ + Wo+^nW^^)M^ > 0, (A.17) 

where Eat = Cov(vec(A^)) = (/m2 + Krn,m)(X' ® ^) under normality of vec(X). We are 
going to show MqWo^nW^^M^ = 0. This together with the fact MoWq+^nW^^M^ > 
then establishes the desired result. Observe that 



MoWo^nW^,M^ 



MAWB,g,^NWl^^^...^ 



Ml HAWB,g,^NWl^^^M^ 



HbWa,p,'^nWI^^^MI HbWa,pJ^nWI^^^MI 
From model (1), S = (5o(8)v4o)(T + o-2/,„J(5o® Ao)^ + o-^QiJo^Ao, where Qbo®Aq = Qbo® 
Pao + Pbo ® Qao + Qbo ® Qao- This implies WB,qoT.NWBg^^ = and Wa,poSa^W^J,po+ = 
and, hence, the diagonal elements of the above matrix vanish. For the off-diagonal 
elements, the same reasoning can be used to deduce that {MAWB,qo)^NW^.p^_^_ = and 
{MBWA,po)^NWB,qo+ — 0' which establishes (A. 16). A direct calculation further gives 



MoWo+J^nW^^M^ = a' 



{q-qo)MA{Ip2 + Kp^p)Ml 

(p-Po)Mb(/,2 + K,JM^ 

which equals a zero matrix if and only if (po, %) = {p, g)- 



□ 



26 



Lemma A.l. Assume model (1) and assume that the leading eigenvalues {A* : i = 
1, • ■ ■ ,po} and {^* : j = 1, ■ ■ ■ , go} of (2DpPCA are simple roots. Then, the differentials 
of (2D)'^PCA components with respect to S under {p,q) = (po^Qo) ore given by 

Proof. We only derive the differential of A*, where the case of B* is similarly obtained. 
Remember that (2D)^PCA components A* are leading eigenvectors of K^* = E[XX'^] 
with eigenvalues A*. A standard argument (Sibson, 1979) then gives 



dyec{a*) 



dyec{KA*) 

= aJ®MM, (A.19) 

where the second equality follows from Theorem 2.6 with M^i being defined in Theo- 
rem 3.6. Turning to the differential of Ka* with respect to S. It is always true that 

Ka* = ® Ipf^ibj ® Ip), 

where {bj : j > go} are defined in the beginning of Theorem 3.6. Thus, we have 

From (A.19)-(A.20) and the chain rule, the proof is completed. □ 



27 




28 




29 




30 



L 



1 1 II u III! in II [III luu 
































Figure 4: Leading 100 MPCA basis images. 
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Figure 5: Leading 100 conventional PCA basis images. 
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Figure 6: The reconstructed images for the test face by adding more basis elements are 
compared for MPCA (top row) and conventional PCA (bottom row). Both mean face and the 
target face are put in this figure as references. Plots in the right-most column are projection 
coefficients (in absolute value) onto PCA subspace. 




Figure 7: The variability pattern plots for MPCA for 1 < p < 28 and 1 < g < 28. 
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